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Setting initial values of parameters of mixture distributions estimated by using the EM 
recursive algorithm is very important to the overall quality of estimation. None of the 
existing methods is suitable for mixtures with large number of components. We present 
a relevant methodology of estimating initial values of parameters of univariate, het- 
eroscedastic Gaussian mixtures, on the basis of the dynamic programming algorithm 
for partitioning the range of observations into bins. We evaluate variants of dynamic 
programming method corresponding to different scoring functions for partitioning. For 
simulated and real datasets we demonstrate superior efficiency of the proposed method 
compared to existing techniques. 
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1. Introduction 

Due to the importance of Gaussian distribution a significant part of the research on 
improving performance of the expectation maximization (EM) recursive algorithm 
[McLachlan and Peel (2000)] for estimation of parameters of mixture distributions 
is focused on estimating parameters of mixtures of Gaussian components. A prob¬ 
lem of crucial importance is the choice of initial values for mixture parameters. 
A good choice of a starting point for the EM iterations can result in reducing the 
probability of erroneous estimation of parameters and/or in (faster) convergence of 
EM iterations. Approaches to initializing EM iterations were extensively discussed 
and studied [McLachlan and Peel (2000); Karlis and Xekalaki (2003); Biernacki 
et al. (2003); Biernacki (2004); Maitra (2009); O’Hagan et al. (2012); Melnykov and 
Melnykov (2012)]. A simple approach is random initialization involving generation 
of initial values of parameters and component weights (mixing proportions) by us¬ 
ing some assumed probability distributions [McLachlan and Peel (2000)]. Another 
simple idea is using data quantilles to estimate initial means and variances of com¬ 
ponents to start EM iterations. A group of approaches involve using some kind of 
clustering procedure (hierarchical clustering or k-means clustering) applied for the 
data set to compute initial parameters for EM iterations [Biernacki et al. (2003); 
Maitra (2009); O’Hagan et al. (2012)]. Some other ideas for computing initial val¬ 
ues involve using sample moments of data [Karlis and Xekalaki (2003)], method of 
using singular value decompositions for multivariate mixtures [Melnykov and Mel¬ 
nykov (2012)], using properties of EM trajectories and data moments to limit the 
search space for initial parameters of multivariable Gaussian mixtures [Biernacki 
(2004)]. Available software packages for mixture modeling [McLachlan and Peel 
(1999); Biernacki et al. (2006); Fraley and Raftery (1999); Richardson and Green 
(1997)] offer different possibilities for setting initial conditions for EM iterations. 

A challenge for the existing methodologies is decomposition of univariate, het- 
eroscedastic Gaussian mixtures with large number of components. It can be verified 
in practical computations that with the increase of the number of components in 
the mixture, the application of the EM algorithm with the published methods for 
setting initial conditions would lead to the mixture models of the progresively worse 
quality. Yet, problems of mixture decompositions of univariate models where the 
numbers of components are large are encountered in many applications. Some ex¬ 
amples are given hereafter. Frequency (amplitude or power) spectra of different time 
domain signals may contain numerous compenents. E.g., in the vibration diagnos¬ 
tics or speech recognition applications frequency spectra of measurement signals 
can be analyzed as mixtures of tens of Gaussian functions [Fraiha Machado et al. 
(2013)]. In nuclear magnetic resonance (NMR) spectroscopy free induction decay 
(FID) signal contains components corresponding to molecules present in the sample, 
associated with different frequencies. Frequency spectrum of such signal contains 
tens of components. Moreover, spectral signatures of some metabolites can exhibit 
complex shapes, which may require including even more components for their mod- 
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eling [Chylla (2012)]. In some applications concerning modeling, interpolation and 
forecasting of time signals mixtures of Gaussian functions are applied, which include 
tens of components [Eirola and Lendasse (2013)]. Time of flight mass spectrometry, 
a high-throughput, experimental technique in molecular biology provides measure¬ 
ments of peptide, protein or lipid compositions of biological samples. Mass spectral 
signals can be modeled by mixture decompositions, where numbers of components 
can reach even several hundreds [Pietrowska et al. (2011); Polanski et al. (2015)]. 

The aim of this paper is to develop and evaluate the method for estimating 
initial values of parameters for EM iterations, for univariate, multi-component, het- 
eroscedastic Gaussian mixtures, based on the dynamic programming partitioning 
algorithm. Partitioning the data points into bins, by dynamic programming, deter¬ 
mines initial values of weights, means and variances of Gaussian components for the 
EM algorithm. The idea of partitioning an interval into bins, by using dynamic pro¬ 
gramming, with the aim of obtaining solution to some fitting or estimation problem 
was formulated by Bellman [Bellman (1961)] and then studied/used by other au¬ 
thors in many contexts [e.g., Hebrail et al. (2010)]. The advantage of using dynamic 
programming over partitioning (clustering) heuristics by hierarchical clustering or 
greedy algorithms is that the dynamic programming method allows for obtaining 
global optimal solution for a given quality function. According to authors’ best 
knowledge, dynamic programming partitions were so far not used for initalization 
of EM iterations. 

We present the dynamic programming algorithm for computing initial values 
for mixture parameters for EM iterations. We also study the problem of the choice 
of the scoring function. We compare several possible scoring functions and, on the 
basis of computational experiments, we propose the one best suited to the prob¬ 
lem of computing initial values of mixture parameters. For a number of artificially 
generated univariate data sets, with multiple Gaussian, heteroscedastic components 
we compare some other methodologies of initialization of EM iterations with the 
dynamic programming method and we show its advantage. In the areas of applica¬ 
tions of mixture decompositions listed afore, measured signals are often numbers of 
counts or intensities. In the context of the mixture decompositions of such datasets 
we describe the version of EM iterations appropriate for binned data and we develop 
an approach for the dynamic programming partition for computing initial condi¬ 
tions. We apply the dynamic programming method for initializing EM iterations to 
the real dataset coming from protein mass spectrometry experiment and we again 
demonstrate the efficiency of dynamic programming method for initialization of the 
EM iterations. 

2. Gaussian mixture modeling 

Gaussian mixture modeling of the dataset of univariate scalar observations, x = 
[x\,X 2 ,..., Xj\f] involves estimating the vector of mixture parameters 

P = [ou,..., oik , /ii, ■■■, Vk,<t i, —,0k\ 
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(component weights, means and standard deviations) of the mixture probability 
density function (pdf) 

K 

f mzx (x,p) = '^2a k f(x,/j, k ,a k ), ( 1 ) 

1 

where /(#, /i&, cr^) is the Gaussian pdf and a k = 1, such that the log-likelihood 

function 


N 

L (x, p) = lc) g ( X n, P) 

n—1 


( 2 ) 


is maximized [McLachlan and Peel (2000)]. 

Analyzes of datasets where measurements are numbers of counts or intensities 
involve formulations of mixture model decompositions for binned data [McLachlan 
and Peel (2000)], where x is a vector of equally spaced centers of bins and observa¬ 
tions are given by a vector y = [ 3 / 1 , 2 / 2 > of counts y n , n = 1...N generated by 

multinomial distribution with probabilities p n defined by areas of bins, 


K 

Pn = ^ \ 
fc =1 



2 ,l*k,(7k) ~ &{x n 


2 ’ 




( 3 ) 


In the above <P{x, p k ,o k ) is the cumulative probability distribution function of the 
Gaussian distribution and 6 is a bin width. We assume that bins are ’’dense”, which 
allows for approximating the area of the bin by the product of its width and the 
probability density function at the bin center, and consequently for using the log- 
likelihood function defined by the following formula 

N 

L (x, p) = ^2 y„ log f mix (x n , p ). (4) 

n—1 


2.1. EM iterations 


Maximization of the log-likelihood functions (2) and (4) is done by using EM iter¬ 
ations - the successive updates of parameter vector p old p new : where 


^old told, „old ..old ..old _old ^old~\ 

P , •••, &K f P'1 > •••) P'K J) 


, old 


_old ..old 


old _old 


old i 


and 


^new r new _ new ..new ..new „new „new i 

p = [oq ,p 1 ,...,p K ,<J 1 ,...,<t k J. 

Standard formulae for EM iterations are e.g., [McLachlan and Peel (2000); Bilmes 
(1998)]. The version of EM iterations appropriate for binned data with dense bins 
is similar to standard EM iterations. It involves defining conditional distributions 
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of hidden variables, Xn (corresponding to unknown assignments of observations to 
components) 


P \Xn = k \x n ] = 


r^old f ..old rrOldX 

a k J\' L n,H>k i a k ) 

K 


E ryOld 4~ (ry‘ I,old rrOld\ 


and updates of parameters estimates given by 


N 


_ new n—1 
a k ~ 


E VnP[Xn = k \x n ] 


N 

E Vn 

n—1 


(5) 


( 6 ) 


N 

E ynX n P[Xn = k \x n ] 

new _ n—1 _ 

Vk — N 

E VnP[Xn =k\x n ] 

n=l 


(7) 


N 

E y n {x n - VT W ?P\Xn = k \x n ) 

{aT W ? = — -^-, ( 8 ) 

E ynP[Xn = k \x n ] 

n—1 

k = 1,2,. ..K. 

2.2. Preventing divergence of EM iterations 

Some assumptions should be made concerning execution of the EM iterations. In 
the case of unequal variances of components of the Gaussian mixture, considered 
here, the log-likelihood (2) or (4) is unbounded [Kiefer and Wolfowitz (1956)]. Un¬ 
boundedness results in a possibility of encountering divergence of EM iterations in 
practical computations and in the need for using approaches for preventing diver¬ 
gence [Yao (2010); Ingrassia (2004)]. Here we prevent divergence of EM iterations 
by a simple constraint conditions. Namely we do not allow standard deviations of 
Gaussian components and mixing proportions to fall below given threshold values 
Cmin and a m ; n i.e., we augment equations for iterations for standard deviations and 
component weights by additional constraints 

a k ew t— max(u(] eu ', cr min ) (9) 

and 

a k €W t— max(a“"',Q min ). (10) 

The above constraints are sufficient to prevent divergence of EM iterations. 

The constraints values assumed in EM iterations are cr m i n = 10 _2 ,a m i n = 10~ 4 
for the simulated datasets and <r m i n = l,a m i n = 10~ 5 for the proteomic dataset. 
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3. Problem formulation 


The problem studied in this paper concerns determining initial values for mixture 
parameters for EM iterations, /xjL nl , cr£ nl , aj( nl , k = 1,2for the best quality 
of the mixture parameters estimates. All methods for setting initial conditions for 
EM iterations, studied and compared here, rely on partitions of the data range 
performed according to some criterion. We assume that observations x±, x 2 , •••, xn 
are sorted in the ascending order 

Xi < X2 < ... < Xn■ ( 11 ) 


Partitions are defined by blocks B,, B 2l ..., B^, fc-th block contains samples i,i + 
1 ,-,j, B k = {i,i + l,-.,j}. 

Partitions defined by blocks are used for computing initial values for parameters. 
Initial means are computed as 


/4 nl = 7 


n—j 

\ ' 




initial values for standard deviations are computed as 


( 12 ) 


n=j 


\j-i + l^ Xn A ^ 1)2 

\ n—i 


and initial values for mixing proportions are computed by 

„ ini j — i + 1 


N 


N 


(13) 


(14) 


In the above expression #B k denotes the number of measurements in the block B k . 

For the case of the binned data the appropriate expressions for initial values 
of parameters, implied by partitions given by blocks B k = {i,i + 1,..., j}, are as 
follows. Initial values for means are computed as 


n=j 

^ ) x n w ni 

n—i 

initial values for standard deviations are computed as 


(15) 


^.ini 

°k = 


n—j 


A I y ] w n(x n ^t); 111 ) 2 , 

^ n=i 

and initial values for component weights are computed as 

v 

ini _ 

“ 77jv ' 

Z^ n—l y n 

In expressions (15)-(17) by w n we denote 

Vn 

W n = -:-. 

L~iv=i iJv 


(16) 


(17) 


( 18 ) 
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4. Initializing EM iterations by using dynamic programming 
partitioning algorithm 

In this section we describe the algorithm for computing partitions by using the 
dynamic programming method. Partitioning of the observations involves defining 
K blocks 

B\,B 2 ,...,B K (19) 

where (as already stated) each of the blocks is a range of successive indexes of 
observations 


B k = + 


( 20 ) 


For each of the blocks (more precisely, for data in the block) we compute a scoring 
function, denoted either by Q(B k ) or by Q(xi,Xi+ 1 , 

Q{Bk^) = ***5 Xj'), (21) 


The problem of optimal partitioning involves defining blocks (19), such that the 
cumulative scoring index Q(B\, B 2 , ..., Bk) is minimized, 

K 

Q(B 1 ,B 2 , Bk) = ^2 Q( B k) -> min . ( 22 ) 

k =1 

The solution to the optimal partitioning problem (22) by dynamic program¬ 
ming [Bellman (1961)] is obtained by iterative application of the following Bellman 
equation 


Q° 1 p j {k+1)= min Q° p i _ 1 (/c) + Q(x t , x i+ u •••, Xj), 


(23) 


where Q° p \(k) denotes the optimal cumulative partial score of partitioning the range 
1 ...i into k blocks. 


4.1. Scoring functions 

The scoring function Q(Bk) used in the dynamic programming algorithm of data 
partition (22)-(23) should be designed in such a way that it allows for detection of 
the dense subgroups in the data. A scoring function often used in the literature is 
the weighted sum of squares of within block deviations of data points from mean 


n=j 

Q{Bk) ='^2'Yk 


x n - 


j-i + 


-x- 


(24) 


Often, weights 7 k are assumed as normalizing factors for numbers of elements in 
the block, which leads to the scoring function, which we define as Qi(Bk) 

1 2 


Qi(B k ) = 77 


1 


O' - * +1) 


n=j 

E 


1 


x n - 


j -i + 1 


V=J 

E a 


(25) 
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The scoring function Q\(B k ) is a within-block variance of the data. Intuitively, 
Qi(B k ) is a reasonable measure of concentration of data points within the defined 
clusters for detecting clusters in the data. 

Other definitions of scoring functions are also possible and it seems an interesting 
issue whether the use of other scoring functions can improve data partitions, clusters 
detection and, consequently estimation of the mixture parameters. Therefore, below 
we define other scoring indexes, obtained by some modifications of Qi(B k ) (25). 

The second scoring function <32 (-£4) is a within-block, standard deviation 

Q 2 (£ fc ) = VQi(B k ). (26) 


We also use a third scoring function Q 3 (B k ) defined as a ratio of the within- 
block, standard deviation and the block sample range, 

Q 3 (B k ) = ^ Ql{ - Bk \ (27) 

Xj — Xi 

Intuitively, the scoring fucntion Q 3 {B k ) takes smaller values for blocks where data 
poins are concentrated close to the center and larger values otherwise. The property 
of the scoring function Q 3 (B k ) is that it is dimensionless and, in the large number of 
data points limit, depends only on the shape of the probability distribution function 
of the data. 

Finally, we also introduce a fourth scoring function, Q 4 (B k , A), which is a mod¬ 
ification of the scoring function Q 3 (B k ), such that some preference is given to wider 
blocks in comparison to narrower blocks. The idea of giving preference to wider 
blocks is motivated by the fact that very narrow blocks detected by Q 3 (B k ) may 
correspond to random variations of the data rather than to the true dense sub¬ 
groups related to the structure of the Gaussian components. In order to give some 
preference to wider blocks we modify Q 3 (B k ) by introducing additional parameter 
A, which leads to the scoring function Q 3 {B k , A) 


Qi(B k , A) 


A + \J Qi(B k ) 

Xj — Xi 


(28) 


Adding a positive constant A in the numerator of the above expression results in 
limiting the possibility of “shrinking” the numerator of Q 4 (B k ,A) to values very 
close to 0 , which can happen when narrow random “peaks” occur in the data. 


4.2. Scoring functions for binned data 

For the case of application of the dynamic programming method to binned data the 
appropriate modification of the expression for the scoring function Q1 is 

2 

o-3 \ 

X n ^ ( XjyWv j , (29) 

v—i / 

where w n is defined as in (18). Formulas for scoring functions Q 2 , Q 3 and Q 4 , for 
binned data, are given by (26), (27) and (28), with (25) replaced by (29). 
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4.3. Properties of different scoring functions 

Different scoring indexes may lead to different partitions of the data. Therefore, in 
the computational experiments further reported in this paper we apply and com¬ 
pare all scoring indexes Q±(Bk), Q- 2 (Bk), Qz(Bk) and Qi(Bk, A). Some preliminary 
observations (later systematically verified) are summarized below. 

The index Qi(Bk) has a tendency to over-penalize wide components, which can 
result in splitting some of the wider components into two and in merging narrow 
components into one. The index Q 2 (Bk) shows high sensitivity for the case where 
there is little overlap between components. However, when the overlap between 
components increases, it has the tendency to randomly merge components. The 
index Qs(Bk) shows advantages following from its dimensionless construction and 
often leads to correct partitions. However, it shows sensitivity to noise in the data. 
Finally, the modified index Q±{Bk, A) allows for improving the performance of 
Qz{Bk) by robustification against noise in the data. 

5. Reference methods of setting initial condition for EM iterations 

We compare dynamic programming partitions to several reference methods for set¬ 
ting initial conditions for EM iterations. These methods were already studied in the 
literature [Biernacki et al. (2003); Maitra (2009); Fraley and Raftery (1999)] and 
they were proven to be useful approaches for estimating mixture decompositions of 
datasets. The first reference method of generation of initial mixture parameters is 
the method of equal quantiles, used e.g. as the default option in the software pack¬ 
age Mclust [Fraley and Raftery (1999)]. Here bins are defined by equal quantiles 
of the dataset. Two other reference methods are hierarchical clustering algorithms, 
e.g. [Hastie et al. (2009)], where clusters of samples are created by successive op¬ 
eration of merging based on distances between samples and/or distances between 
clusters of samples. We apply two versions of hierarchical clustering, with average 
and complete linkage [Hastie et al. (2009)]. Initial values for component means, 
standard deviations and weights are defined by blocks obtained by application of 
the hierarchical clustering algorithms. 

6. Results 

We have conducted several computational experiments for systematic comparisons 
of the methods of setting initial values of parameters for EM iterations by the dy¬ 
namic programming algorithm and the three reference methods. We are using the 
following abbreviations for algorithms for setting initial conditions: E-Q - equal 
quantiles algorithm, H-clu-c - hierarchical clustering algorithm with complete link¬ 
age, H-clu-a - hierarchical clustering algorithm with average linkage, DP-Q1, DP- 
Q2, DP-Q3, DP-Q4(A) - dynamic programming algorithm with scoring function 
Ql, Q2, Q3, Q4(A). In the subsections below we first define performance criteria 
for comparing results of applying different algorithms. Then we describe two groups 
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of computational experiments, artificially created data and proteomic mass spectral 
data and we report results of comparisons of different methods for setting initial 
conditions for EM iterations. 


6.1. Performance criteria 

Performance criteria for evaluating results of parameter estimation algorithms are 
based on values of the differences between true and estimated parameters or on 
the values of the log-likelihood functions obtained in EM iterations, averaged over 
repeated experiments. Since in our constructions of the quality criteria we aim at 
making reasonable comparisons of different initialization algorithms for datasets 
corresponding to different mixtures then we need to introduce additional scalings 
and orderings of the values of differences or log-likelihoods, as described below. 


Difference between values of true and estimated parameters. The first ap¬ 
proach to evaluating results of mixtures parameter estimation algorithms is using 
a direct criterion given by a scaled difference between estimated and true values. 
We use a scaled absolute difference between true and estimated locations of com¬ 
ponents, averaged over all components. Scaling is aimed at making the distribution 
of errors invariant with respect to component widths and to components weights. 
The criterion is defined as follows 


1 K 
d = -Y 

K 


\ri rue ~ 




(30) 


»=i 


In the above expression, p t i zne , cr Zrne and a| rue are true parameters of the analyzed 
mixture distribution, K is the number of mixture components and N is the sample 
size. By /x® st we understand the value of the estimated mixture component mean 
closest to p tzue . Due to skewness of the distributions of D, we use log(Z?) as eventual 
measure of the quality of parameter estimation. 

The value of the quality criterion log(D) averaged over mixture datasets is de¬ 
noted as 


Avg[\og(D)] (31) 

and further used in reporting results of computational experiments. 

Log-likelihoods. The direct criterion defined in the previous subsection can be 
used only in the case where the true compositions of the analyzed mixtures are 
known. Since we study both types of data with known and unknown compositions of 
mixtures, we also use a second approach to evaluating results of mixtures parameter 
estimation algorithms, based on values of the log-likelihood functions. Values of the 
log-likelihoods obtained in the EM iterations can be used for scoring performance 
of different algorithms in both cases of known or unknown values of true mixture 
parameters. 
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In the case of analysis of a single dataset on can use the obtained values of 
log likelihoods to order performances of different algorithms. Higher values of log 
likelihoods imply better quality of the solution obtained by using an algorithm. 
There are exceptions from the rule “higher likelihood —» better estimate of mix¬ 
ture parameters” caused by the possible existence of the spurious local maximizers 
[McLachlan and Peel (2000)], but their influence on obscuring results of evaluations 
of performances of different algorithms is strongly reduced by the used constraints 
on values of standard deviations of mixture components. 

Often we are not interested in using values of the log likelihood functions for 
the analysis of one dataset but rather for comparisons involving many datasets. 
In that case, typically the differences between log-likelihoods obtained for differ¬ 
ent mixtures (different datasets), are much larger than differences of log-likelihoods 
resulting from applying different initialization algorithms for the same data-set. 
Therefore orderings or scalings are used in order to compensate for this. In this 
study we use the criterion applied previously in the literature [Karlis and Xekalaki 
(2003); Biernacki et al. (2003)] defined by the percentage (probability) of attaining 
“maximum” likelihood by a method of initialization of EM iterations. By “maxi¬ 
mum” likelihood we understand the maximal value over all applied algorithms. We 
also assume that “a method no to attained maximum likelihood”, means that the 
difference between “maximum” likelihood and the TO-th likelihood is lower then 5% 
of the range of all log likelihoods. The value of this criterion, estimated by averaging 
over repeated computational experiments, is denoted by 

Avg(P). (32) 


6.2. Simulated datasets 


The first computational experiment involves analyses of artificially created datasets, 
which are 10 component Gaussian mixtures, with known values of means, standard 
deviations and weights of Gaussian components. In the simulated datasets we are 
controlling the overlap between Gaussian components, due to its strong influence on 
the results of the fit. Several measures of the degree of overlap between two Gaussian 
components have been proposed in the literature, e.g., [Sun and Wang (2011)]. They 
use different approaches. A simple method is based on distances between Gaussian 
distributions (Mahalanobis, Bhattacharyya). Here we define a simple parameter ov 
for measuring the degree of overlap between neighboring components, 


ov = exp 


I Mi ~ AG+ij 

2^/crf + cr 2 + i 


(33) 


Parameter ov assumes value equal or close to zero for disjoint components and 
larger values for components of stronger overlap. Maximal possible value assumed 
by the overlap parameter is ov = 1, which occurs in the case where gi = /q+i. 
The definition of ov in (33) can be interpreted as an adaptation/simplification of 
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the idea of the Bhattacharyya distance. The contruction (33) simplifies the overlap 
definition by the Bhattacharyya distance in the sense that components with equal 
means, which show maximal possible overlap ov = 1, can be possibly distinguished 
based on the Bhattacharyya distance by differences between variances. Despite this 
simplification, the overlap measure (33) is useful for our analyzes, due to the fact 
that we are not considering mixtures with components of similar means and different 
variances (claw-like mixtures) [McLaclilan and Peel (2000)]. 

True parameters of each of the Gaussian mixtures are drawn randomly in each 
stochastic simulation experiment. Draws of parameters of the Gaussian mixtures 
are designed such that overlaps between neighboring components are constant over 
one dataset. One stochastic simulation experiment includes three steps: 

(1) Draw randomly values of variances and weights of Gaussian components 
cri, ..-,<710 and ay,Quo- 

(2) Define (i\ = 0 and compute values of means of Gaussian components, fi 2 , •••, Mio 
such that values of overlapping coefficient (33) between successive components 
has a given constant value ov. 

(3) Generate 1000 independent samples of 10-component Gaussian mixtures with 
parameters ay,..., quo, Hi,...,(aio and oy,..., oyo. 

Differences between stochastic simulation experiments involve different methods for 
drawing weights a±, ..., Quo and standard deviations ay,..., a io of Gaussian compo¬ 
nents in the mixtures and different values of the overlapping coefficient ov. Four 
groups of datasets are generated. Each includes 5 series of experiments correspond¬ 
ing, respectively, to the following 5 values of the overlapping coefficient: ov = 0.05, 
0.1, 0.15, 0.2, 0.25. Each series corresponds to one value of overlapping coefficient 
ov and includes 500 datasets, generated according to steps 1-3. Different groups use 
different scenarios for generating weights and variances of Gaussian components, 
described below. 

Group 1: Equal mixing proportions. Low variability of standard devia¬ 
tions. In this group equal values of mixing proportions are assumed, ay = 012 = 
... = quo = 0.1 and values of component standard deviations are generated ran¬ 
domly from uniform distribution 17(0.5,1) in each dataset. Values in parenthesis 
give the range of possible changes of standard deviations. So this scenario allows 
only for low (two-fold) variability of standard deviations of Gaussian components. 

Group 2: Equal mixing proportions. High variability of standard de¬ 
viations. In this group again equal values of mixing proportions are assumed, 
Oil = 02 = ••• = ot \o = 0.1 and values of component standard deviations are gener¬ 
ated randomly from uniform distribution [7(0.05,1). This scenario allows for high 
(20-fold) variability of standard deviations of Gaussian components. 
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Group 3: Different mixing proportions. Low variability of standard 
deviations. In this group different values of mixing proportions are assumed, 
ai = 1^5 > a 2 = ttio = and values of component standard deviations are 

generated randomly from uniform distribution L/(0.5,1), which corresponds to low 
(2-fold) variability of standard deviations of Gaussian components. 

Group 4: Different mixing proportions. High variability of standard 
deviations. In this group different values of mixing proportions are assumed, 
ai = ^ 5> a 2 = ^,...,ccio = 7575 and values of component standard deviations are 
generated randomly from uniform distribution 17(0.05,1), which corresponds to for 
high (20-fold) variability of standard deviations of Gaussian components. 

Comparisons of performances of different algorithms. In the computational 
experiments parameters of the Gaussian mixtures were estimated by using EM 
iterations started with each of the above described, seven algorithms of setting 
initial conditions, E-Q, H-clu-c, H-clu-a, DP-Q1, DP-Q2, DP-Q3, DP-Q4(A). Per¬ 
formances of algorithms are evaluated by using quality indexes Avg[log(D)\ (31), 
and Avg(P) (32). Results, obtained by averaging over 500 datasets in each of the 
data generation scenario are presented in figure 1. This figure includes 5 subplots 
arranged in columns and rows. Subplots in the two upper rows depict results of ap¬ 
plying seven algorithms of setting initial conditions in terms of Acg[log(.D)]. Each of 
the subplots corresponds to one group of datasets (experiments). In the top-left sub¬ 
plot, corresponding to the group 1 of experiments with equal weights of components 
and low variability of standard deviations of components, all initialization methods 
show high and similar performances. In the subplot (second row, left column) corre¬ 
sponding to group 3 of experiments with low variability of standard deviations and 
different component weights a method, which shows significantly lower performance 
is the equal quantiles method E-Q. Other clustering methods (H-clu-c, H-clu-a, DP- 
Ql, DP-Q2, DP-Q3, DP-Q4(A)) again show quite similar performances, however 
differences are here bigger then in the previous plot. In the subplots in the right 
column corresponding to groups 2 and 4 of stochastic simulation experiments, both 
with high variability of standard deviations of Gaussian components, we can observe 
stronger diversity between results of different initialization algorithms. The perfor¬ 
mance of the equal quantiles algorithm E-Q is high in group 2 (equal component 
weights), however in group 4 (different component weights) E-Q is the algorithm of 
the worst performance. 

In all groups algorithms based on dynamic programming method DP-Q1, DP- 
Q2, DP-Q3 and DP-Q4 either lead to similar values of Avg[\og(D)] or outperform 
reference methods. The cases where performance of hierarchical clustering method 
with average linkage can be slightly higher than dynamic programming methods 
are groups 1 and 3 (low variability of standard deviations). For groups 2 and 4 
(high variability of standard deviations) there is a strong advantage of the dynamic 
programming algorithms over the reference methods. 
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Group 1 



Correlation plot 



Avg(P) 


Group 2 



ov 


Legend: 

□ — E-Q 
V — H-clu-c 
A — H-clu-a 
O — DP-Q1 
0 — DP-Q2 
☆ — DP-Q3 
$ — DP-Q4, A=0.1 


Fig. 1. Comparisons of results of applying algorithms, E-Q, H-clu-c, H-clu-a, DP-Q1, DP-Q2, DP- 
Q3, DP-Q4(A) for estimating mixture parameters for simulated datasets. (Explanations in the 
text). 


High performance of the equal quantiles, E-Q, algorithm, in groups 1 and 2, can 
be considered as a kind of artifact. It does not follow from the high capability of 
the algorithm to locate positions of components, but rather from the fact that true 
values of component weights coincide with equal quantiles assumed in the algorithm. 

We can also make observations concerning comparisons between variants of dy¬ 
namic programming algorithms based on differetnt scoring functions. Performance 
of the algorithm DP-Q1, based on the scoring function (25) given by a sample 
variance, is high in groups 1 and 3 where the variability of component standard 
deviations is low. However, in groups 2 and 4 where the variability of component 
standard deviations strongly increases, the algorithm DP-Q1 exhibits worse perfor- 
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mance compared to other variants of dynamic programming method. This is con¬ 
sistent with the tendency of the dynamic programming partition with the scoring 
function Q1 to incorrectly merge narrow components. Performance of the algorithm 
DP-Q2 is high for low values of the overlap coefficient ov, but strongly decreases 
with the increase of ov. Performance of the algorithm DP-Q4(A), A = 0.1 is better 
than the performance of DP-Q3. 

In the bottom-left subplot we show a scatter-plot of values of indexes 
Avg[log(D )] (31) versus values of the probability index Avg(P) (32). In the scatter- 
plot strong, negative correlation between values of Aug [log (D)] and Avg(P) is seen, 
which confirms the potential of the index Avg(P) to serve as a reasonable estimate 
and a comparison tool for the performance of different algorithms. 

6.3. Protein spectra dataset 

A proteomic mass spectrum contains information about mass-to-clrarge (m/z) val¬ 
ues of registered ions, denoted by x n , versus their abundances i.e., numbers of 
counts from the ion detector denoted by y n , n denotes index of the data point. In 
real experiments the dataset consists of more than one spectrum (sample). To each 
point x n along the m/z axis correspond counts y sn , where s denotes the index of 
the sample. 

The second computational experiment in our study was the analysis of the pro¬ 
teomic dataset, which included 52 low resolution proteomic mass spectra of human 
blood plasma [Pietrowska et al. (2011)]. This dataset was obtained in the clinical 
study where blood samples were collected in the group of 22 head and neck cancer 
patients and in the group of 30 healthy donors. Raw spectra consisted of approxi¬ 
mately 45 000 m/z values, covering the range of 2 000 to 12 000 Da. Spectral signals 
were binned with the resolution 1 Da and baselines were removed with the use of a 
version of the algorithm described in [Sauve and Speed (2004)]. For further analysis 
only spectral fragments ranging from 2 000 to 4 120 Da have been selected. The 
choice of the range 2 000 to 4 120 Da was motivated by the fact that this fragment 
of m/z scale contains several protein and peptide species interesting as potential 
candidates for biomarkers. 

Comparison of performances of different algorithms Computational exper¬ 
iments on the proteomic dataset involves modeling spectral signals as mixtures of 
Gaussian components, estimating models parameters by using EM algorithm and 
comparing qualities of models obtained by using different algorithms of setting 
initial conditions for EM iterations. The quality criterion for comparing different 
initialization methods was Avg(P), where averaging is done over all spectral signals 
in the dataset. Clearly, the criterion Avg[\og(D)] cannot be used here due to the 
lack of knowledge on the true parameters of mixture models. 

For spectral signals with high variability of standard deviations, strong, differ¬ 
ent overlaps between components and large number of components, the differences 
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Avg(P) 



Legend: 

O — DP-Q1 
☆ — DP-Q3 

# — DP-Q4, A=1 
if — DP-Q4, A=5 

* — DP-Q4, A=10 




Fig. 2. Upper panel: Comparison of performances of algorithms of setting initial conditions DP- 
Ql, DP-Q3, DP-Q4(A = 1), DP-Q4(A = 5), DP-Q4(A = 10) for proteomic spectra dataset based 
on the Avg(P) index. Middle panel: Spectral signal y n = yin- Lower panel: A fragment of the 
spectrum y n = yin (black) versus its mixture model (green). 


between performances of different algorithms of initialization of the EM iterations 
are magnified compared to the simulated dataset. Application of algorithms, EQ, 
Hclu-c, Hclu-a, DP-Q1, DP-Q2, DP-Q3, DP-Q4, for the proteomic dataset, lead 
to large differences between values of log likelihood functions of models. Since 
initialization methods EQ, Hclu-c, Hclu-a and DP-Q2 exhibit significantly lower 
performance than DP-Q1, DP-Q3 and DP-Q4, then we have confined the set of 
compared algorithms to those of the highest quality, DP-Q1, DP-Q3 and DP-Q4. 
We have decomposed each of the spectral signals into Gaussian mixture. EM algo¬ 
rithm was initialized with the use of the following five algorithms: DP-Q1, DP-Q3, 
DP-Q4(A = 1), DP-Q4(A = 5), DP-Q4(A = 10). Decompositions were computed 
with numbers of components K ranging from 50 to 150. 

In figure 2, in the upper panel, we present the plot of values of the index Avg(P) 
versus the number of Gaussian components K. One can observe that, on the average, 
DP-Q4 shows higher performance than DP-Q1 and DP-Q3. High performance of 
the algorithm DP-Q4 was observed for quite wide range of the parameter A, A = 1, 
A = 5, A = 10. 

In the middle panel in figure 2 a plot of the spectral signal y n = y\ n (after 
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preprocessing operations of baseline correction and binning) corresponding to the 
sample no 1, within the range 2 000 to 4 120 Da. In the lower panel, the plot of 
a fragment of the spectrum y n = y ln (black), within the range 2 500 to 3 500 Da, 
is drawn versus its mixture model (green) obtained with the use of the algorithm 
DP-Q4 (A = 10). The number of components (K = 90) was estimated with the use 
of Bayesian information criterion [Schwarz (1978)]. One can observe a good fit of 
the computed mixture model. 


7. Discussion 

Despite simplicity of construction of the EM algorithm, the research on its per¬ 
formance and possible improvements is extensive and includes several steps of 
the algorithm: initialization, stopping conditions, preventing divergence, execution 
[McLachlan and Peel (2000)]. Modifications in each of the above steps interact one 
with another in the sense of influencing the overall performance of the algorithm. 
In this paper we have focused on initialization of EM for certain types of mixture 
distributions. We have also mentioned some solutions for stopping and preventing 
divergence. The latter of the above listed issues, concerning modifications of exe¬ 
cution of EM steps for improving its performance, is however also worth discussing 
due to its relations to the topic of our study. 

Several papers introduce modifications of E and M steps of the EM algorithm 
designed in such a way that searching through parameter space becomes more in¬ 
tensive, which can help in avoiding local maxima of the likelihood function and 
make the recursive process more robust against the choice of initial conditions e.g., 
[Zhou and Lange (2010)]. A group of approaches to enhancing performance of EM 
iterations involves multiple runs (threads) and repeats of EM steps combined with 
criteria of selecting between threads [Karlis and Xekalaki (2003); Biernacki et al. 
(2003); O’Hagan et al. (2012)]. The simplest version of short runs initiation [Karlis 
and Xekalaki (2003); Biernacki et al. (2003)] involves generating multiple initial 
values for random methods and starting EM iterations for the one corresponding 
to highest likelihood. Then only one iteration process, namely the one, which at¬ 
tained the highest value of the likelihood function is continued. A recently developed 
implementation, “burn in EM” [O’Hagan et al. (2012)], involves continuation of re¬ 
cursions of multiple threads combined with gradual elimination of the worse on the 
basis of the value of likelihood function. Several ideas of improving performance 
of EM iterations were related to modifications of the log-likelihood function corre¬ 
sponding to the Gaussian mixture model.One example of such an approach is the 
profile likelihood method in [Yao (2010)]. Introducing constraints and/or modifica¬ 
tions of the form of the likelihood function both prevent divergence of iterations 
and lead to improvement of performance of the corresponding variant of the EM 
algorithm. 

Each of the above discussed approaches can be treated as competitive to our 
algorithm in the sense that it can lead to improvement of estimation of parame- 
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ters of mixture distributions. We did not present comparisons of our method to 
the above approaches. However, according to out experience, for the type of data 
analyzed in this paper, univariate, heteroscedastic, multi-component, precise initial¬ 
ization is more important than possible improvements following from modifications 
of execution of EM iterations. We should also mention that improvements of EM 
initialization can be combined with improvements in EM execution to lead to even 
better quality of mixture parameters estimation. 


8. Conclusions 

The first conclusion of our study is that initialization methods based on dynamic 
programming, DP-Q1, DP-Q2, DP-Q3, DP-Q4(A), show advantage over the ref¬ 
erence methods EQ, Hclu-c, Hclu-a. We have compared initialization algorithms 
for a variety of mixture data with the overlap between neighboring components 
controlled by the parameter (33) and different values of variances and component 
weights (groups 1-4). This allowed for characterizing dependence of performances 
of algorithm on parameters of mixture distributions. The advantage of the dynamic 
programming initialization methods over the reference methods is highest for het¬ 
eroscedastic mixture distributions with different mixing proportions. 

The second conclusion is that performance of the dynamic programming par¬ 
titioning algorithm used for initialization of EM iterations depends on the scoring 
function used in the algorithm. We have studied several variants of the dynamic 
programming partition algorithms defined by different scoring functions (25)-(28). 
The conclusion coming from these analyzes, drawn on the basis of both Avg\Yog(D)\ 
and Avg(P) criterion was that for the type of datasets with different mixing propor¬ 
tions and high variability of standard deviations of components, the most efficient 
EM initialization method is the dynamic programming algorithm with the scoring 
function Q4. 

We have also applied the dynamic programming partition algorithms DP-Q1, 
DP-Q3 and DP-Q4 for initialization of EM iterations for estimating mixture model 
parameters for proteomic dataset including 52 low resolution mass spectral signals. 
Comparisons of values of the Avg(P) performance criterion lead to the conclusion 
that again the method of the highest efficiency is the dynamic programming parti¬ 
tion with the scoring function Q4. 

The dynamic programming method with the scoring function Q4 needs the ad¬ 
justment of the parameter A. However, computing decompositions for several values 
of A (1, 5, 10) leads to the conclusion that the algorithm shows high performance for 
quite broad range of values of A. So adjusting the value of A can be done efficiently 
and robustly. 

The dynamic programming algorithm applied to the partitioning problem has 
a quadratic computational complexity with respect to the number of elements of 
vector of observations x. Despite computational load, the advantage of using dy¬ 
namic programming method for initialization of the EM algorithm is the quality of 
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the obtained mixture model. In the majority of applications of mixture models the 
quality of the model is more important than the computational complexity of the 
algorithm used for the computations. 


Acknowledgments 

This paper was partially financially supported by the scientific projects from 
the Polish National Center for Science (NCN) and the Polish National Center 
for Research and Development (NCBIR). JP was supported by NCN Harmonia 
grant DEC-2013/08/M/ST6/00924, AP was supported by NCN Opus grant UMO- 
2011/01/B/ST6/06868, MM was supported by NCBiR grant POIG.02.03.01-24- 
099/13. Computations were performed with the use of the infrastructure provided 
by the NCBIR POIG.02.03.01-24-099/13 grant: GeCONil - Upper Silesian Center 
for Computational Science and Engineering. 


Supplementary materials 

Supplementary materials are Matlab scripts and functions for performing compar¬ 
isons of partitioning algorithms E-Q, H-clu-c, H-clu-a, DP-Q4 for the data described 
as Group 4 in section 6.2. Demo computations are started by launching Matlab 
script partitions-em-demo. One stochastic simulation experiment is performed (in¬ 
cluding three steps 1-3 listed in section 6.2). Results of computations are shown 
by plots of partitions and data histograms versus estimated probability density 
functions. Values of errors Au<?[log(.D)] and likelihoods are also reported. By mod¬ 
ifications of the Matlab code other computational scenarios for simulated data can 
be also easily realized. 
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